This Plot looks only at the most frequent words use over the entire corpus of tweets
This Plot looks only at the most frequent words use over the entire corpus of tweets and grouped by Twitter handle
From the word frequency one can see that the Government focuses on supportive words. Some of the news agencies use words related to themselves the most, however, there are still some words like “wc”, “level3lockdown” and “test” that might give us more insight into what topics are discussed. In the next section we will group the words into bi-grams to help build more context for what topics could be present.
The words are grouped by their adjacent words forming two word groups and then these groups gets counted.
Looking at the word frequency one can be ascertain that News24 frequently reported on the testing backlog, eNCA reported about the lockdown and briefings and EWN reported on foreign nationals, while the Government posted more on supportive measures.
Through the use of the “ldatuning” package, it realizes 4 metrics: “Griffiths2004”, “CaoJuan2009”, “Arun2010”, “Deveaud2014” to select the perfect number of topics for a LDA model. The total number of CPU cores can be indicated for optimal performance when executing this method. The larger the dataset, the longer it takes to calculate the results. For more information on this method and the various metrics to obtain the optimal K topics, visit: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html or https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
While looking at the results of this plot, there can be seen that metrics “Griffiths2004”, “Arun2010”, “Deveaud2014” are not informative for this specific LDA dataset. To find the optimal “K” amount of topics, one needs to look for an “elbow” (a situation where the plot changes abruptly). Thus the optimal amount of topics within the tweet dataset according to the “CaoJuan2009” metric lies between 4 and 10.
Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43
Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive lda model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. http://doi.org/10.1016/j.neucom.2008.06.011
Romain Deveaud, Éric SanJuan, and Patrice Bellot. 2014. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 17, 1: 61–84. http://doi.org/10.3166/dn.17.1.61-84
Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101
Martin Ponweiser. 2012. Latent dirichlet allocation in r. Retrieved from http://epub.wu.ac.at/id/eprint/3558
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 8 based on the method used for identifying the optimal K value. This model was used with a “beta” matrix in order to examine per-topic-per-word probabilities.
Looking at the topic model for all the tweets one can identify the following eight tweets during the time period 19 May to 18 June 2020:
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 4. This model uses a “beta” matrix in order to examine per-topic-per-word probabilities.
Looking at Media24’s Topic Model one can identify the following 4 topic areas during the period of 19 May to 18 June 2020:
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 4. This model uses a “beta” matrix in order to examine per-topic-per-word probabilities.
Looking at EWNupdates’s Topic Model one can identify the following 4 topic areas during the period of 19 May to 18 June 2020:
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 4. This model uses a “beta” matrix in order to examine per-topic-per-word probabilities..
Looking at ENCA Topic Model one can identify the following 4 topic areas during the period of 19 May to 18 June 2020:
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 6. SABC News had the most overall Tweets and a topic model with the only to have 6 topics as the other media houses yielded mixed results. This model uses a “beta” matrix in order to examine per-topic-per-word probabilities.
Looking at SABC News Topic Model one can identify the following 6 topic areas during the period of 19 May to 18 June 2020:
This topic model was built using LDA (“Latent Dirichlet Allocation”) with a K parameter of 4. This model uses a “beta” matrix in order to examine per-topic-per-word probabilities.
Looking at GovernmentZA’s Topic Model one can identify the following 3 topic areas during the period of 19 May to 18 June 2020:
Topic number 3 is still somewhat ambiguous